A network-based approach to identifying correlations between phylogeny, morphological traits and occurrence of fish species in US river basins

The complex network framework has been successfully used to model interactions between entities in Complex Systems in the Biological Sciences such as Proteomics, Genomics, Neuroscience, and Ecology. Networks of organisms at different spatial scales and in different ecosystems have provided insights into community assembly patterns and emergent properties of ecological systems. In the present work, we investigate two questions pertaining to fish species assembly rules in US river basins, a) if morphologically similar fish species also tend to be phylogenetically closer, and b) to what extent are co-occurring species that are phylogenetically close also morphologically similar? For the first question, we construct a network of Hydrologic Unit Code 8 (HUC8) regions as nodes with interaction strengths (edges) governed by the number of common species. For each of the modules of this network, which are found to be geographically separated, there is differential yet significant evidence that phylogenetic distance predicts morphological distance. For the second question, we construct and analyze nearest neighbor directed networks of species based on their morphological distances and phylogenetic distances. Through module detection on these networks and comparing the module-level mean phylogenetic distance and mean morphological distance with the number of basins of common occurrence of species in modules, we find that both phylogeny and morphology of species have significant roles in governing species co-occurrence, i.e. phylogenetically and morphologically distant species tend to co-exist more. In addition, between the two quantities (morphological distance and phylogentic distance), we find that morphological distance is a stronger determinant of species co-occurrences.


Introduction
Functional traits and phylogenetic relatedness are important attributes of species in a community to assess the two critical facets of biodiversity i.e., functional and phylogenetic diversity. It is often assumed that the functional diversity of animal and plant species are concordant with their phylogenetic diversity [1][2][3]. Furthermore, for optimizing conservation strategies, it is crucial to understand whether conserving phylogenetic diversity will help in conserving functional diversity of species [4][5][6]. The relationship between functional traits and phylogenetic closeness has been widely explored in various taxa in the past decades. Recently, there have been many studies to better understand whether ecologically relevant functional traits are conserved along the phylogeny [7][8][9][10]. To do this, phylogenetic signal is typically addressed by testing if closely related species share more similar traits than expected by chance [11][12][13].
The functional and phylogenetic similarity between co-occurring species also plays a key role in the processes governing community assembly [14,15]. Understanding the determinants of species co-occurrence in communities is a fundamental question in ecology [16]; therefore a few studies [15,17] have focused on the structure of communities in species interaction networks as a way of gaining insights on coexistence mechanisms. There are two mutually exclusive mechanisms to illustrate the similarity of co-occurring species in a community. The competition mechanism [18][19][20] predicts that if a community is structured by competition, then species within the community should be more different functionally and phylogenetically than expected from assemblages that are comprised of random samples drawn from the potential species pool. In contrast, the environmental filtering mechanism predicts that if a particular type of habitat requires possession of certain adaptive traits, species within the communities should have more similar traits or be more closely related than expected by chance [21]. Ultimately, patterns of species co-occurrence may depend on how both biotic interactions and environmental filtering act over ecological and evolutionary time scales [22,23]. For instance, Winston (1995) [14] reported that functionally similar species co-occurred less frequently than more dissimilar species for 219 fish communities from the Red River basin in the US, supporting the biotic interactions mechanism. In contrast, Peres-Neto (2004) [21] reported that co-occurring species had greater functional and phylogenetic similarity than expected by chance for stream fish species in the Macau River basin in Brazil, and concluded that the co-occurrence patterns were mainly driven by the environmental filtering. As the organization of species into communities is a non-random process constrained by their interactions and system dynamics, it is important to identify such community assembly rules against the backdrop of species' evolutionary history. Hence, a framework is needed that incorporates phylogenetic and morphological relatedness of species to reveal their groups (communities). Beyond the group-level averages of morphological relatedness and phylogenetic relatedness of species, and number of basins of their co-occurrence, one can study the correlations between these three quantities. Hence, such a framework can help understand if morphological and phylogenetic closeness have differential impact on community assembly rules.
In this work, we complied occurrence data of native freshwater fishes from 2073 watersheds in the US. We measured 10 morphological traits that related to fish functions [24] and subsequently compiled phylogenetic information on these fish species [25]. We test the relatedness of species phylogeny with species morphology; and their relation with the species co-occurrences. For this, we used the concept of complex networks. The complex network framework not only allows the modeling of one-to-one interactions (relatedness) between entities of a complex system [26,27], but can also explain the emergent structural patterns that have implications for its function [28]. Ecological Networks, in general, are finding increasing applicability in the study of species interactions, spatial ecology, community assembly, and its relation with phylogeny [29][30][31][32]. Studies have used fish co-occurrence networks for identifying a spatial cluster of fish-subgroups [33], and phylogenetic networks to retrace species dispersal history [34] in freshwater fishes in Ontario, Canada. Layeghifard et al. (2015) [35] presented a multi-species spatial network framework to describe the relationship between the spatial positioning of sites and the patterns of patch occupation by species across multiple ecological communities. Their approach is useful in inferring dispersal effects on spatial heterogeneity within metacommunities. Li et al. (2021) [29] developed an optimal information flow model to infer causality of interactions in complex ecosystems. Their approach of inferring predictive ecosystem networks based on transfer entropy and functional covariance offers a method to predict biodiversity changes under climate and anthropogenic stressors, and hence has potential in fish preservation. Based on our continental-scale fish dataset, we seek to answer the following questions: 1) Is morphological closeness of species related to phylogenetic closeness (for HUC8 level river basins), and 2) does trait similarity of fish species together with phylogenetic closeness predict species co-occurrence in HUC8 basins, and to what degree. To answer these questions, it is essential to look at all the three attributes of species (phylogeny, morphology and co-occurrence) simultaneously. This is because the correlation between two of these factors, say phylogenetic closeness and co-occurrence, might be influenced by their mutual correlation to a third factor, in this case morphological closeness. Our work therefore builds on and extends earlier network-based studies that explored similar questions [36][37][38], but only focused on one or two of these attributes. In the present work we attempt to identify possible interdependence among all three attributes using the complex network framework and continental-scale fish species phylogenetic, morphological and cooccurrence data. The goal of these analyses is to more carefully analyze the interrelationships among these three quantities, and to identify which pairwise relationships are strongest. The two classes of networks we constructed from the datasets are 1) co-occurrence based basin network of HUC8 regions, and 2) nearest-neighbour directed networks of species based on the metrics of phylogenetic distances and morphological distances to other species in the system. Through module-extraction on these networks, which identifies groups of densely connected entities in the network, we quantify relationships among these quantities. In summary, our goal is to understand correlations between fish phylogeny, fish morphology and their extent of co-occurrence via a multi-attribute network-based analysis, wherein all the three quantities are simultaneously studied.
About the dataset 1. Species occurrence data: We compiled native species occurrence records at the watershed scale (i.e., Hydrologic Unit Code 8; HUC8) from NatureServe (https://www.natureserve. org) and included both extant and extinct species to account for species historically present in a given watershed. Records of non-native species occurrence were excluded from the entire study. Overall, the occurrence dataset has presence and absence information of 804 fish species in 2073 HUC8 regions. The NatureServe dataset provides comprehensive species lists per watersheds, and is commonly used as fish community data in the scientific literature [39][40][41][42]. Furthermore, as our aim is to disentangle the morphological and phylogenetic relationship between the species in the native communities under natural processes, non-native species, which were mostly introduced by humans, were removed from our analysis. For showing HUC8 subbasins on the US map, we use their centroid locations, obtained from the Watershed Boundary Dataset, United States Geological Survey (USGS, https://www.usgs.gov/national-hydrography/watershed-boundary-dataset).

Functional traits:
We compiled ten morphological traits related to fish locomotion and food acquisition from the FISHMORPH database [43,44], and FishBase [45]. The ten morphological traits include maximum body length (Length), body elongation (BlBd), relative eye size (EdHd), oral gap position (MoBd), relative maxillary length (JlHd), vertical eye position (EhBd), body laterally shape (HdBd), pectoral fin vertical position (PFlBl), pectoral fin size (PFiBd), and caudal peduncle throttling (CFdCP). Due to insufficient information on some species, some values were missing in the raw functional trait data. We statistically imputed these missing values [24] with a machine learning algorithm called #x2018;missForest#x2019; [46,47]. This method uses a random forest trained on the observed values of a data matrix to predict the missing values and automatically calibrates the filling values by a set of iterations [48]. The missforest algorithm is one of the best performing methods to statistically impute missing trait values [46,47] and has been used in many studies [24,[49][50][51]. Here, we used this method with all parameters set to default values to impute the 10 trait missing values, while accounting for taxonomic information. To improve the accuracy of the results, we log-transformed the values before putting them into the model, and then back transformed them after imputation.

Phylogenetic relatedness:
We obtained the phylogenetic information of these fish species from Rabosky et al. (2018) [25], which includes 31, 526 marine and freshwater ray-finned fishes. This dataset is based on 11, 638 species whose position was estimated from genetic data; the remaining 19, 888 species were placed in the tree using stochastic polytomy resolution. We pruned the tree and kept only the 804 native freshwater fish species used in our study. Then we computed the pairwise distances between the pairs of these species from the pruned phylogenetic tree by using the 'cophenetic.phylo' function from the R package "ape" [52] (version 5.6-2) with all parameters were set to default values.

Methodology
We use the phylogenetic distance (PD) dataset to measure phylogenetic closeness and farness between the fish species. In the trait dataset, 10 morphological traits of all species are normalized (by division by maximum value of the trait) to avoid the impact of their magnitudes on the analysis. Further, we assume them as vectors in a 10-dimensional trait space, and visualize each species as points in this vector space. For assessing the trait resemblances of species from the morphological traits dataset, we needed to define a measure and therefore we borrow the concept of cosine-similarity (CS). Hence, for measuring the trait-similarity between species, we use the CS [53][54][55] measure. The CS between trait vectors is the angular distance (θ) between the species in the 10-dimensional feature space, measured as the cosine of θ. For any given two species (say S 1 and S 2 ), the morphological similarity between them is measured using the definition: where T 1 and T 2 are trait vectors of species, and n = 10, as there are 10 traits in total. Hence, unlike the Euclidean distance, it is independent of the vector magnitudes, with CS = 1 implying that the vectors are co-linear or that the species are similar and CS = 0 that the vectors are orthogonal or that the species are not similar in their traits. However, because we intend to correlate trait similarity with phylogeny, which is conventionally expressed as phylogenetic distance (PD), we use cosine-distance (CD) defined as CD = 1 − CS, instead of CS, and call it morphological distance between species throughout our analyses for consistency.

Complex network framework
First category of networks that we study is the basin network. In the basin network, the nodes are 2073 HUC8 regions and edges depict the number of fish species these basins share. Hence, absence of an edge would depict no common species between the basins. We demonstrate the construction of basin network in Fig 2(a) for eight basins each having some species. This network is weighted and un-directed by nature, where edge-weights are the number of species the two basins in question, share between them. We perform module-detection on the basin network to identify groups of basins with higher co-occurrences and their geographical distribution. For correlating co-occurrence information with phylogenetic closeness and trait similarity, we obtain basin-wise means of PD (hPDi) and CD (hCDi) of species within each basin. Further, to infer dependence between morphological and phylogenetic distances, we look for association between these quantities by performing linear regression on the data of each cluster and obtaining R 2 and corresponding p-value to ascertain the quality of the fit. To explore characteristic ranges of hPDi and hCDi in these clusters we obtain clusters of randomly selected basins (from all the 2073 HUC8 regions), of equal size to the actual clusters. The ranges of hPDi and hCDi in random clusters is then compared with the actual cluster ranges.
Once we have explored the relationship between phylogenetic closeness and trait similarity from the basin network analyses, the next two questions we ask are: if these quantities correlate or explain variances in number of basins of co-occurrences? And, which of these quantities is a more fundamental or a stronger determinant of species co-occurrence? To explore these questions we look at the species networks, which is the second cataegory of the networks we construct and analyse.
Studying species interactions in a network framework allows us to gauge the overall connection structure in a single snapshot. To this end, we construct directed species networks wherein an edge from a node (species) to another node means that the latter is the nearest neighbor of the former. There are two networks like this, one of nearest neighbors based on phylogenetic distance, and one of nearest neighbors based on morphological distance. For the first network, for example, if the two species are phylogenetically closer to each other, i.e. they have small PD between them, they are connected via an edge, and similarly, other node pairs in the network are assigned their connections. The number of connections beginning from all source nodes are fixed beforehand and each of them have those many outgoing connections. This results in directed networks wherein edges capture species-specific interactions or the asymmetry of interactions. For example, if species P is among the top nearest phylogentic neighbours of species Q, it does not imply that species Q is among the top nearest phylogentic neighbours of species P. Hence, if a directed edge exist from Q to P, it may not exist from P to Q. In this manner, we identify a fixed number of nearest neighbors (NN) nodes for each node in the network and obtain directed networks of fish species. The procedure for constructing species networks based on a pre-defined number of NN is explained in Fig 1(b). The matrix defining the connections in the network is called the adjacency matrix (A), where A ij = 1 denotes the presence of connection and A ij = 0 denotes the absence of the connection between nodes i and j. As described, these are directed networks (identified by the presence of directed links between the nodes as shown in Fig 1(b)), as each of the species selects its NNs depending on the value of its PD or CD to target species. The two kinds of species' networks, constructed and analyzed in this work are: • PD based Network: Each species in the network connects to NN other species that are phylogenetically closest to itself.
• CD based Network: Each species connects to NN other species that are trait-wise most similar to itself.
Hence, for PD based species network, we connect all the 804 species to their NN phylogenetic nearest neighbours, and for CD based species network, we connect them to their NN trait-wise nearest neighbours. Further, for both categories of these species networks we identify modules (or groups) of species using a network community-detection algorithm and then explore co-occurrence of the species within modules, in the HUC8 regions. For species within each of these modules, we obtain mean over pair-wise PD (hPDi) and mean over pair-wise CD (hCDi). Following this, we obtain the number of HUC8 regions in which at least two of the species within the modules occur together. Hence, each of the species modules is assigned three quantities: hCDi, hPDi and the number of basins of co-occurrence of at least two species. We denote the last quantity as NBS in the upcoming text.
As stated earlier, our goal is to ascertain which of these metrics, actually and significantly plays a role in community assembly rules. Hence, we define two null hypotheses: H 1 and H 2 as follows.
1. H 1 : The number of basins of co-occurrence of species are not related to their trait similarity.
2. H 2 : The number of basins of co-occurrence of species are not related to their phylogenetic closeness. The panel on the right shows an example network of five species, with each node connecting to two of its NNs. This network (main) is formed by aggregation of different sub-networks. The panel on the left shows these sub-networks; first sub-network shows species S 1 connecting to S 2 and S 3 , second shows S 2 connecting to S 1 and S 4 , and so on. Alongside each of these sub-networks and the main network is shown an adjacency matrix, where 1 (0) in a cell indicated the presence (absence) of connection between the nodes. The network on the right shows all the connections in the sub-networks on the left, and its adjacency matrix is the sum of all the sub-network adjacency matrices. The solid edges are the ones explicitly shown in the sub-networks, and the dashed edges are edges from subnetworks whose display has been skipped on the left. https://doi.org/10.1371/journal.pone.0287482.g001

PLOS ONE
Identifying correlations between phylogeny, morphological traits and occurrence of fish species Notice that the averages (denoted by hxi) in the species network depict the mean of pairwise x over species in a given module, whereas in the basin network they are the mean of pairwise x over species in a given basin (HUC8). To summarize, we construct and analyze both of these kinds of networks (basin network and species networks) to assess the co-dependence between the three quantities, i.e. phylogenetic distance, morphological distance and probability of co-occurrence of species. While the basin network is helpful in understanding the geographical distributions of phylogenetic and morphological diversity, the species networks help us find meaningful species communities, their collective chances of occurrences vis-à-vis their phylogentic and morphological relation. The following section explains the significance of and method for identifying modules in the networks.

Communities in the network
An important concept in a network study is that of a module [56]. A module is a set of nodes that are more densely connected among themselves than to the rest of the network. For example, in a collaboration network, this would mean a group of authors that write papers with each other more frequently than with other authors in the network. For our basin network, a module is a subset of basins that on average share more species within the subset than with the remaining basins. The identification and visualization of these modules can help in understanding which basins tend to cluster together and if the modules are restricted to specific geographical locations on the network. Similarly, a module in a network of species is a group of species that are phylogenetically (if the network is constructed based on PD) or trait-wise (if the network is constructed based on CD) closer to each other, than to the rest of species. To identify modules in the undirected basin network, we use the Louvain algorithm [57] that returns the best partition of the network into modules by optimizing network modularity (Q) [58]. For the species networks, we use Leiden algorithm [59], that also uses modularity optimization and can efficiently identify modules in directed networks. A high value of Q (close to 1) indicates high divisibility of the network into clearly defined modules and vice-versa. Therefore, module compositions of the network are robust representations of the clustering of network nodes when the modularity is high, i.e., close to 1.

Results
Here we present our results from the analysis of basin network first and then from the species networks.

Relation between species phylogenetic distances and trait dissimilarity at the watershed level: Analysis of basin network
Using the basin network, we explore the relation between mean phylogenetic distances and mean trait dissimilarity of species within basins in the context of their geographical occurrence. Due to a higher species richness in the eastern part of the US than the western part (see S1c Fig), the network is denser in the eastern part, signifying a larger number of co-occurrences of species in those basins. On performing the module detection on the basin network, geographically clustered groups of basins are obtained. By definition, basins within a cluster share more species among them than they share with basins in other clusters. The obtained clusters are shown in different colours on the US map in Fig 2(a) (the network edges are lightened for visualization purposes). The clusters on the map reveal geographical basin groups that have larger co-occurrences of species. Additionally, the clusters on the West end (red) and East end (orange) of the map seem to be broadly demarcated from their adjacent clusters by geographical divides that separate watersheds flowing into different oceans (shown by black bold lines on the map).
For each of these basin clusters with high co-occurring species within their basins, we obtain basin-wise means of PD and CD of species within each basin. On plotting mean CD (hCDi) versus mean PD (hPDi), for each of these five basin clusters, we make two interesting observations. The first one supports the correlation between hCDi and hPDi of the species as can be seen from subplots in Fig 2(b) for all the five clusters. The R 2 score from regression The second observation is related to the ranges of hCDi and hPDi for basins within the clusters. These ranges can be read from the Fig 2(b) for all five clusters. We observe that both hCDi and hPDi ranges are towards higher values ([hCDi> = 0.02] and [hPDi> = 200]) for three clusters (1, 2, 4) out of five. This observation becomes clearer on comparison with random clusters. For all five random clusters, shown in Fig 2(c), although we again obtain significant correlation between the hCDi and hPDi, the ranges of the quantities also cover the lowest values of the respective means, i.e. hCDi< = 0.02 and hPDi< = 200, apart from higher values. This is also true for other randomizations (not shown) over basin clusters.
On our dataset, we calculated the phylogenetic signal-a traditionally used method to infer statistical dependency between trait values of species as a consequence of their phylogenetic relation. In Table 1, we present the values of Bloomberg's K and Pagel's λ, which suggest existence of strong phylogenetic signal for native fish species in the US.

Primary determinant of species co-occurrence: Phylogeny or morphology?
In the previous sub-section, we established that basin level hPDi and hCDi are correlated quantities for US freshwater fish species, irrespective of the geographical location of the basin in consideration. The already established robust correlation between hPDi and hCDi hint that it is safe to expect that either both or neither of these quantities explain variance in chances of species co-occurrence. In this sub-section our focus is to analyze PD and CD based species networks to ascertain which among these two quantities is a stronger determinant of species cooccurrence. These networks for CD metric for two nearest neighbour choices, NN = 10 and NN = 50 are shown in S3 Fig, with  To understand if the species co-occurrences are dictated by their trait dissimilarity, i.e., to test hypothesis H 1 as defined in methods section, we obtain the correlation of hCDi and NBS, for the CD-based networks for a range of NN values (NN = 1 to NN = 100). These plots are shown in Fig 3, for NN = 10 (a) and NN = 50 (b) where data is sorted in ascending order of Table 1. Phylogenetic signal of morphological traits. For each morphological trait, the table presents the phylogenetic signal in terms of two metrics-Blomberg's K [11] and Pagel's λ [60], along with their p-values. The ten morphological traits are relative eye size (EdHd), oral gap position (MoBd), relative maxillary length (JlHd), vertical eye position (EhBd), body elongation (BlBd), body laterally shape (HdBd), pectoral fin size (PFiBd), pectoral fin vertical position (PFlBl), caudal peduncle throttling (CFdCP) and maximum body length (Length). NBS. The horizontal dashed blue line is the mean over hCDi of all clusters. For each cluster in the CD-based network, we also obtain the hPDi of species within them. The red dots and the dashed red horizontal line in the same plots shows cluster-wise hPDi and mean over hPDi of all the clusters, respectively. From these two plots, it appears that NBS increases with hCDi with some exceptions. Additionally, we observe that hCDi and the corresponding hPDi fluctuate similarly around their mean values (red and blue horizontal lines). To statistically test the dependence of NBS on hCDi, i.e. to test our hypothesis H 1 , we obtain R 2 coefficient and pvalue with hCDi as the independent variable and NBS as the dependent variable (see values in Table 2). From this analysis, we find that the dependence is significant (p-value < 0.05) for NN = 10 network but not for NN = 50 network. On performing similar test with hCDi as the independent variable and hPDi as the dependent variable, we again find the dependence is significant (p-value < 0.05) for NN = 10 network but not for NN = 50 network.
To check hypothesis H 2 , we do similar analysis with PD-based network, i.e. investigate if hPDi can explain the variance in NBS. The results are shown in Fig 3(c) and 3(d). In this case, we see a clear trend of variation of the quantities among each other for NN = 10 (c) network https://doi.org/10.1371/journal.pone.0287482.g003

Table 2. Summary statistics of linear regression models showing relation among three variables: Number of basins of co-occurrence of at least two species (NBS), phylogenetic distance (PD) and cosine distance (CD) between species.
x and y represent independent and dependent variables, respectively, in the regression analysis. Results for four settings with respect to the metric used of species network construction and the number of nearest neighbours are shown. The R 2 coefficient and the p-values shown here are for a single experiment. and for NN = 50 network. To ascertain the significance of observed relations between the three quantities, we obtain linear regression statistics for these networks in terms of R 2 coefficient and p-value as shown in Table 2 for these particular nearest neighbour choices. For the NN = 10 PD-based network we find that the 21.7% of the variance in NBS (R 2 = 0.217, p < 0.05) is significantly explained by hPDi. For NN = 10 CD-based networks we observe significant correlations between NBS and hCDi (R 2 = 0.412, p < 0.05). The R 2 values for NN = 50 networks are also presented alongside which tell that correlation between NBS and hCDi (for CD based network) and correlation between NBS and hPDi (for PD based network) are not significant. This motivated exploring statistics for a range of NN values for both these network types. Restricting to NN values that lead to higher modularity values of obtained networks and hence result in meaningful modules (see S2 Fig), we chose the range 2 < NN < 100 for both the network types. The statistics for the full range of nearest neighbour choices are presented in plot (a) in Fig 4, which shows R 2 values, with p-value < 0.05 shown by filled circles and pvalue � 0.05 shown using hollow circles. All the results in this figure are averages over 100 runs of module identification algorithm for each NN choice, to account for any randomness in module assignment. From the full range plots (Fig 4(a)), we observe that there exist (almost) distinct ranges of NN values for which the hypotheses H 1 and H 2 can be rejected. For the PD-based networks, this range is 1 � NN � 10, but for the CD-based networks, the range is 10 < NN � 23. Additionally, we observe that for the CD-based network, the significant R 2 values (hCDi explaining variance in NBS) are higher 0.4 < R 2 < 0.6 than the PD-based network where significant R 2 values (hPDi explaining variance in NBS) are 0.0 < R 2 < = 0.2. From the latter observation, we can infer that NBS is more strongly dependent on hCDi than hPDi. Next in our the full range analysis, we return to correlations between hCDi and hPDi. For the CD-based network, we explore if hCDi explains variance in hPDi, and for the PD-based network, we explore how

PLOS ONE
Identifying correlations between phylogeny, morphological traits and occurrence of fish species hPDi explains variance in hCDi. We observe, as shown in plot (b) of the Fig 4, that there is much longer range of NN values where hPDi explains variance in hCDi than the case where hCDi explains variance in hPDi. This observation reflects that species clusters obtained based on phylogenetic closeness tend to incorporate morphologically similar species for a larger range of nearest neighbours.
So far we have not given any biological meaning to modules we identify, nor do we claim that these are the unique representations of phylogenetically close or morphologically similar species. However, to understand if (indirect) evidence points to the relevance of these modules and to check for the validity of our results, we examine the statistics obtained from random modules of species. To this end, we randomly select species and consider them as modules such that they are of the same sizes as the actual modules for the whole NN range that we are interested in. We observe that neither hCDi nor hPDi explains variance in NBS for any of the NN values. On the other hand, both hCDi and hPDi are correlated with each other, but for a smaller range of NN than with analysis using the actual networks' module. The plots for the random module analysis are shown in Fig 4(c) and 4(d). Note that, here too the range of NN for hPDi significantly explaining variance in hCDi is longer than the reverse case. Both these observations argue in favour of the modules we identify and the conclusions we draw from their analysis.

Discussion
In this work, we set out to understand how phylogenetic diversity is related to (or explains) morphological diversity of fish species and if morphology and phylogeny govern the species content of ecological fish communities. We use the phylogenetic distances, morphlological traits, and occurrence information of native fish species in the HUC8 regions of the US. Clearly, the dataset we use in our experiments accounts for a large geographical expanse in terms of species occurrence and also accounts for all native fishes in the US. To the best of our knowledge our study is one of the first to utilize continental level fish data for this domain of research.
Traditionally ecologists have relied on using Phylogenetic Signal of morphological traits to understand statistical dependency between trait values of species as a consequence of their phylogenetic relation. Although some recent studies have pointed out that the use of phylogenetic signal is not always reliable [12,13], it is still the state-of-the-art method used in most ecological studies. Blomberg's K and Pagel's λ calculated on the data suggest existence of strong phylogenetic signal for native fish species in the US. This implies that the morphological distance is statistically explained by phylogentic distance. Our modular analysis on basin network results showed significant R 2 values of (0.70, 0.48, 0.60, 0.70 & 0.60), implying that these percentages of variances in hCDi are explained by mean hPDi in each basin cluster. This result, not only (broadly) confirms existence of strong phylogenetic signal, but also shows how this relation varies in different geographical regions across the US, by design. Through a comparison with random basin clusters, we find that basins (HUC8 regions) in the species rich East end of the map (clusters 1, 4) show stronger statistical dependence between these quantities than the rest of basins. These basins also happen to be those that have only higher values of mean phylogenetic distances and higher mean morphological distances given the full range these quantities take for all the basins. In other words, species in basins with higher co-occurrences (in eastern part of the US) stick to larger hCDi and hPDi ranges, whereas this cannot be strictly said for the western part (3, 5 clusters) of the US. Cluster 5 also has a uniquely high number of basins with high mean morphological distance, which means that a few basins in the west have the most morphologically distinct species. The species spatial network proposed by Layeghifard et al. (2015) [35] incorporates spatial network of sites and species co-occurrence information into a common framework, which is used to infer impact of dispersal on species assemblages. The modular structure of our basin network (Fig 2(a)) captures the impact of species dispersal, which is an important factor governing meta-community structure of species [61]. Specifically, nearby basins, which have higher species similarity in species composition, are spatially clustered together (modules); and the red and the orange clusters are separated from those in the middle (as explained by geographical divides shown in black).
There have been studies which have used co-occurrence as a means to infer the trait matching between species that have antagonistic and mutualistic interactions [62]. Similarly, another study explored how co-occurrence of a plant species in Cape Floristic Region is limited by phylogentic relationship between them [63]. In the present study, we also use species co-occurrence of freshwater fish species as a marker to infer the key determinant of their community structure among the two well-known factors: morphology and phylogeny. Our CD based and PD based species network analysis reveals that both phylogenetic distance and morphological dissimilarity determine the chances of species co-occurrences, but between these two, morphological dissimilarity between species more strongly determines if they co-occur or not than the phylogenetic distance. The latter observation aligns with the results in the study by Winston (1995) [14] on cyprinid fish species in Red river basin in the US. The study reports that morphologically more similar species pairs co-occurred to a lower degree than morphologically less similar species. For our null-hypotheses (see Methods) tests pertaining to primary determinant of species co-occurrence, we constructed networks of species where interactions between species pairs are assigned if phylogenetic and morphological distances are small. As a result, the modules in these networks are determined by and have small phylogenetic and morphological distances between species. For CD based networks, the overall hCDi values are small, as expected, and similarly for PD based networks, the hPDi values of different clusters are small, when compared to the overall ranges CD and PD can take (as shown in histograms for CD and PD in S1a and S1b Fig). Although hCDi and hPDi values of the clusters are low in the respective networks, they vary; and this allows to investigate how the number basins of cooccurrence (NBS) of species within the clusters vary with hCDi and hPDi of the clusters. Hence, on studying of variation of NBS with hCDi and hPDi we could infer statistical dependence between them. Moreover, the analyses of modules identified from the networks of species allows for a more complete understanding of the relationship between morphology, phylogeny and chances of species co-occurrence, as compared to looking at differences in mean values among these quantities. A recent study highlights the usefulness of network modules of fish co-occurrence networks to characterize interspecific relationship [33]. Secondly, our observations on CD and PD based networks are made for specific values of NN, and there are NN values for which the null-hypotheses could not be rejected. The larger NN values where the latter is true, result in networks with smaller number of modules of large sizes, which are also not very discrete, i.e. there are a increased number of inter-modular edges. These large modules although show similar ranges of variation in hCDi and hPDi (as for low NN), the NBS values are restricted to a very small range towards very high values. The smaller number of data points (modules) for higher NN values might have lead to the correlations being not captured with statistical significance. Nevertheless, since meaningful modules pertain to good modularity scores (low NN values), our results still stand firmly where this is true. As a result of decreasing modularity score (poorer modular structure) with increasing NN, we limit to NN values in the range NN = 1 and NN = 100 which results in good modularity scores. Apart from modularity score, the number of clusters decreases with increasing NN, due to merger of the modules (obtained at smaller NN with each other). For example, for CD-based networks for NN = 10, the number of clusters was 11, but for NN = 50, the number of clusters was just 6 (see S3 Fig). Similar situation also occurs for the PD based networks.
To estimate extent of co-occurrence (in species network analyses), we used "basins with at least two of the species in a module" (NBS) as the measure. So for a basin to qualify as a cooccurrence basin for that module of species, it should have at least two of the species in the module, and NBS is the total number of such basins. We did not choose higher percentage of species in the module since there are basins with a very small number species, and this can result in NBS to be zero especially for smaller sized modules. Therefore to avoid the trivial values of NBS being zero and still retaining the definition of co-occurrence, we stick with the bare minimum condition of at least two of the species in the module. We also checked that the results do not qualitatively differ when co-occurrence of a higher percentage of species in the module are used as a criteria.
From our first main observation, we reiterate that although phylogenetic distances explain the morphological dissimilarity between fish species (basins lying close to regression line in Fig  2(b)), which also supports the biodiversity conservation strategies prioritizing phylogenetic diversity as a proxy, there are basins that have higher morphological diversity for a smaller phylogenetic diversity (basins lying far from the regression line in Fig 2(b)), for example in basin cluster 2 in eastern US. Our second observation, of morphology being a stronger determinant of species co-occurrence than phylogeny, also points to phylogeny being an imperfect proxy for biodiversity conservation. A few recent studies [5,6,64] also raise similar arguments.

Conclusion
This work proposes a novel framework for identifying correlations between Phylogeny, morphology, and the co-occurrence of fish species. The complex networks framework, wherein connections are determined through nearest-neighbours, not only allows modeling local node-specific interactions as edges, but these local interactions can cause the overall structure of the network to be comprised of clustered group of species (revealed through module identification) which are functionally meaningful. Through our module-level analyses of speciesspecies networks based on phylogenetic and morphological distances, and the basin network based on number co-occuring species, we uncover the statistical dependence also confirmed by traditional measures like Phylogenetic signal of morphological traits, between fish phylogeny and morphology. Additionally, using this framework, we determine which of the two quantities, fish phylogeny and fish morphology, is a stronger determinant of their co-occurrences. We can extract a few take-home messages from our analysis. Firstly, phylogenetic distance (closeness) in fish species explains morphological distance (closeness) among species. Secondly, basins with a higher species richness have higher mean phylogenetic distances and higher mean morphological distance than basins with smaller richnesses. Thirdly, although both phylogentic distance and trait dissimilarity are significant determinants of number of basins of co-occurrence, morphological distance explains greater degree of variance in number of basins of co-occurrence than phylogentic distance. In summary, our observation points to the important role that evolutionary history of species play in their form and function; and the the role of species morphology in their probability of co-occurrence.